PSET5 - Speller

MOC Programming

Created: 2022-07-04
Tags: #fleeting


Abstract:


Theoretically, on input of size n, an algorithm with a running time of n is “asymptotically equivalent,” in terms of O, to an algorithm with a running time of 3n.
Indeed, when describing the running time of an algorithm, we typically focus on the most impactful term (i.e., n in this case, since n could be much larger than 2).

In the real world, though, the fact of the matter is that 2n feels twice as slow as n.

The challenge ahead of you is to

  • implement the fastest spell checker you can!
    By “fastest,” though, we’re talking actual “wall-clock,” not asymptotic, time.

Files

speller.c

  • spell-check a file after loading a dictionary of words in dictionary.c from disk into memory

dictionary.h

  • Prototypes for functions meanwhile, are defined in dictionary.h.
  • That way, both speller.c and dictionary.c can #include the file.

Preprocessor Directive
#define LENGTH 45
defines a “constant” called LENGTH that has a value of 45.
Constant in the sense that you can’t change it in your own code.
clang will replace mentions of LENGTH in the code with, literally, 45.

bool check(const char *word);
unsigned int hash(const char *word);
bool load(const char *dictionary);

char * is what we used to call string.
So those three functions above are essentially just:

bool check(const string word);
unsigned int hash(const string word);
bool load(const string dictionary);

And const, makes it so that strings passed will not get changed
you won’t be able to change them accidentally or smtnh

speller.c

$ ./speller text

will be equivalent to running

$ ./speller dictionaries/large text

text is the file you wish to spell-check. Suffice it to say, the former is easier to type! (Of course, speller will not be able to load any dictionaries until you implement load in dictionary.c! Until then, you’ll see Could not load.)

TIME IN load represents the number of seconds that speller spends executing your implementation of load
TIME IN check represents the number of seconds that speller spends, in total, executing your implementation of check
TIME IN size represents the number of seconds that speller spends executing your implementation of size
TIME IN unload represents the number of seconds that speller spends executing your implementation of unload
TIME IN TOTAL is the sum of those four measurements.

“misspelled” simply mean

  • some word is not in the dictionary provided.

Alright, the challenge now before you is to
implement, in order, 
load
hash
size
check
unload 
as efficiently as possible using a hash table
in such a way that TIME IN loadTIME IN checkTIME IN size, and TIME IN unload are all minimized.

dictionary.c

global pointer array, table
hash table = struct called node
Array contains N node pointers, N26  to match w/ default hash function.
You'll likely increase N depending on your own implementation of hash

We’ve implemented loadchecksize, and unload, but only barely, just enough for the code to compile.
We’ve implemented hash with a sample algorithm based on the first letter of the word.
-> Your job, ultimately, is to re-implement those functions as cleverly as possible so that this spell checker works fast!

Do NOT ALTER

  • speller.c
  • makefile
  • the declarations (i.e., prototypes) of loadhashsizecheck, or unload.

Can Alter

  • N value in dictionary.c, so that your hash table can have more buckets.

Can add functions

  • dictionary.c. (and, in fact, must in order to complete the implementations of loadhashsizecheck, and unload)

Functions in dictionary.c

load

  • loads all of the words in dictionary
  • into some sort of data structure (hash table)
    Hash Table
    -> is an array of individual linked list

Hash Function
-> assigns a number to every point

Steps
-> Open dictionary file
fopen()
make sure its value NULL

-> Read strings from file one at a time
fscanf(file, "%s", word)
fscanf returns EOF once it reaches End Of File

-> Create a new node for each word
Store each of those words in the hash table

Use malloc to each word to store each node
Use this strcpy("n->word", "Hello") means Copy hello to n->word

-> Hash word to obtain a hash value (the index)
Function takes string and returns an Index

-> Insert node into hash table at that location
Set the new node to point to the first position in linked list
Then you can point the first position in linked list to point at the new node

check

  • must be case-insensitive -> if hello in dictionary, then check returns true for all cases HELLO, hElLO, heLLO, heLlO
  • Note: All words in dictionary file are in lowercase format
  • any variations of case will not be considered misspelled
  • Possessives only return true when they are in the dictionary -> foo is in dictionarycheck should return false given foo's if foo's is not also in dictionary.
  • assume that check will only be passed words that contain (uppercase or lowercase) alphabetical characters and possibly apostrophes.
  • return true if word is in dictionary
  • Traverse linked list looking for word (strcasecmp is a function that compares strings but ignores casing)

hash

  • take a word and run a hash function on it
  • returning some number that corresponds to that word

First one letter is good, but it's not fast
First two letters
or using math to use all letters

With three letters
Some letters in 3rd position are missing (Meaning A B ?? D E F G)
Okay, so instead of getting each letter. We could use for loop to get the letters
When we use for loop, do we ignore the apostrophe????
AB =
BA =

I do not understand what the end result of my hash function will be

I have no idea how many buckets my hash function will get

size

  • returns how many words are in your dictionary
  • you could imagine that as you're loading the hash table
  • you could somehow keep track of the number of words you've added to dictionary so far

unload

  • any memory allocated to store data in data structure
  • will be free up in here

Dictionary File

  • Default dictionary has 143,091 words
  • From top to bottom, the file is sorted a-z,
  • with only one word per line (each of which ends with \n)
  • No word is longer than 45 characters,
  • No word appears more than once.
  • All words in the file are in lowercase format

The hash function you write should ultimately be your own, not one you search for online.

References